Introduction

A previous study had conducted both gene expression and metabolomics profiling of tissue samples from breast cancer patients (1). This vigenette will highlight the analysis we conduct on the breast cancer data.

Loading in IntLIM and Files

IntLIM, available from Github, can be installed as in the documentation. Once IntLIM is installed, what is necessary is loading in the library. First clear the workspace.

rm(list = ls())
library(IntLIM)

For breast cancer study, both gene expression (available on Gene Expression Omnibus Accession Number: GSE37751) and metabolomics(1) (http://www.jci.org/articles/view/71180/sd/2) data are available online. Much of this data had been processed as previously described (1). Probes from the Affymetrix data not mapping to a gene symbol were removed. Additionally, as with NCI-60 data, only the probe corresponding to the highest mean expression was used for analysis when multiple probes corresponded to a single gene. This resulted in a total of 20,254 genes for 108 patient samples. The Metabolon data did not need to be filtered by coefficient of variation, as there were no technical replicates. The resulting data consisted of 536 metabolites with 132 patient samples.

This data has been formatted for IntLIM. We load in the data as follows. The bc.ambs.csv meta file contains a list of phenotypic data file, metabolite data file, gene expression data file, metabolite meta file, and gene expression meta file. The function OutputStats will give a summary of the NCI-60 data

Note: The data is found in the BC.data.vignette.zip file. This file has to be de-zipped prior to loading.

Load in the breast cancer data using the ReadData() function. ShowStats allow us to

inputData <- IntLIM::ReadData('bc.ambs.csv',metabid='id',geneid='id')
## [1] "CreateMultiDataSet created"
IntLIM::ShowStats(inputData)
##   Num_Genes Num_Metabolites Num_Samples_withGeneExpression
## 1     20254             536                            108
##   Num_Samples_withMetabolomics Num_Samples_inCommon
## 1                          132                  108

From the OutputStats, we find that we have gene expression data involving 20,254 genes with 108 patient samples and metabolite abundance data involving 536 metabolites with 132 patient samples.

Filtering Gene Expression and Metabolite Data

The FilterData() functino is used to filter the data. We remove genes with mean belows below the 10th percentile. Furthermore, we remove metabolites with more than 80% missing values. This results in gene expression data involving 18,228 genes and 108 patient samples and metabolite abundance data involving 379 metabolites and 132 patient samples.

inputDatafilt <- IntLIM::FilterData(inputData,geneperc=0.10, metabmiss = 0.80)
## [1] "No metabolite filtering by percentile is applied"
IntLIM::ShowStats(inputDatafilt)
##   Num_Genes Num_Metabolites Num_Samples_withGeneExpression
## 1     18228             379                            108
##   Num_Samples_withMetabolomics Num_Samples_inCommon
## 1                          132                  108

We can obtain boxplot distributions of the data as follows. This is used to make figures.

IntLIM::PlotDistributions(inputDatafilt, palette = c("black", "black"))

Principal Component Analysis

The principal component analysis is performed on filtered metabolite and gene expression data to obtain visual representations showing how different sub-sections of the data could be grouped into different clusters. Common samples patient samples (either tumor or adjacent non-tumor samples). Blue samples indicate tumor samples and red samples indicate non-tumor samples. Note the clear delineation of samples.

PlotPCA(inputDatafilt, stype = "DIAG", common = F)

Running Linear Model

The linear model is for integrating transcriptomic and metabolomics data is: E(m|g,t) = β1 + β2 g + β3 p + β4 (g:p) + ε where ‘m’ and ‘g’ are log-transformed metabolite abundances and gene levels respectively, ‘p’ is phenotype (cancer type, patient diagnosis, treatment group, etc), ‘(g:p)’ is the association between gene expression and phenotype, and ‘ε’ is the error term that is normally distributed. A statistically significant p-value of the ‘(g:p)’ association term indicates that the slope relating gene expression and metabolite abundance is different from one phenotype compared to another. We run a linear model on tumor (n = 61) and non-tumor samples (n = 47) that included 18,228 genes and 379 metabolites (total of 6,908,412 possible associations and hence models). For genes and metabolites that had standard deviations of 0 in one of the groups, we assign a p-value of NA. The model is run as below by calling RunIntLim(). DistPvalues() allows us to obtain a distribution of p-values for the (g:p) term. The pvalCorrVolcano function allows us to observe how p-values vary with correlation differences.

myres <- IntLIM::RunIntLim(inputDatafilt, stype="DIAG")
## [1] "Running the analysis on"
## 
## NORMAL  TUMOR 
##     47     61 
## [1] "10 % complete"
## [1] "20 % complete"
## [1] "30 % complete"
## [1] "40 % complete"
## [1] "50 % complete"
## [1] "60 % complete"
## [1] "70 % complete"
## [1] "80 % complete"
## [1] "90 % complete"
##    user  system elapsed 
## 275.169  29.123 304.892
IntLIM::DistPvalues(myres)

IntLIM::pvalCorrVolcano(myres, inputDatafilt, diffcorr = 0.5, pvalcutoff = 0.05)

The next step is to process the results of this model by filtering the results of the linear model by FDR-adjusted p-value cutoff (0.10 selected here) for the (g:p) association coefficient and calculate the correlations of the gene-metabolite pairs in each group (tumor and non-tumor) for the filtered results. We further may only interested in results that have an absolute correlation value difference (0.50 selected here). This is done with the ProcessResults function. In addition we also develop a heatmap of the gene-metabolite association correlations for the selected groups.

myres10 <- IntLIM::ProcessResults(myres,  inputDatafilt, diffcorr = 0.50, pvalcutoff = 0.10)
IntLIM::CorrHeatmap(myres10)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
OutputResults(myres10, filename = "bcresults10.csv")
dim(myres10@filt.results)
## [1] 10861     6

We find that we obtain 10,861 correlations. We try to lessen the number by setting a more stringent p-value cutoff of 0.05 as done below.

myres <- IntLIM::ProcessResults(myres,  inputDatafilt, diffcorr = 0.50, pvalcutoff = 0.05)
#colnames(myres@filt.results)[3] <- "NORMAL"
#colnames(myres@filt.results)[4] <- "TUMOR"
IntLIM::CorrHeatmap(myres, top_pairs = 3000)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
OutputResults(myres, filename = "bcresults5.csv")
dim(myres@filt.results)
## [1] 2842    6

From this model we find 2842 gene-metabolite correlations that have an association FDR-adjusted p-value of 0.05 and an absolute value correlation difference of 0.5 or greater. The top pairs are shown below.

corr.table <- myres@filt.results
abs.corrdiff <- abs(myres@filt.results$NORMAL_cor - myres@filt.results$TUMOR_cor)
sort.table <- corr.table[order(-abs.corrdiff),]
sort.table[1:20,]
##                                   metab     gene NORMAL_cor  TUMOR_cor
## 445                   adrenate (22:4n6)   PDLIM4  0.5925755 -0.6262295
## 1035 docosapentaenoate (n3 DPA; 22:5n3) HIST1H3A -0.5074006  0.6666402
## 454                   adrenate (22:4n6) HIST1H3A -0.4935816  0.6429931
## 472                   adrenate (22:4n6)     GLI3  0.5703712 -0.5602327
## 334                   adrenate (22:4n6)    GRIA4  0.5279288 -0.6001586
## 1372                          guanosine   IGFBP4  0.5839669 -0.5426230
## 378                   adrenate (22:4n6)   IGFBP4  0.5797386 -0.5453728
## 1537                            leucine  COL14A1  0.5366559 -0.5864622
## 1032 docosapentaenoate (n3 DPA; 22:5n3)   PDLIM4  0.5124884 -0.6061952
## 416                   adrenate (22:4n6)     FHL2  0.5068810 -0.6087784
## 377                   adrenate (22:4n6)   LRRC48  0.5799699 -0.5343205
## 443                   adrenate (22:4n6)    CMYA5  0.4980918 -0.6144897
## 831    dihomo-linolenate (20:3n3 or n6) HIST1H3A -0.4354764  0.6760973
## 1530                            leucine  CCDC90A -0.5583950  0.5529878
## 337                   adrenate (22:4n6)    OR5P2  0.6450792 -0.4617134
## 2310                         tryptophan  CCDC90A -0.5450971  0.5574828
## 1793                  N-acetylthreonine  COL14A1  0.5946105 -0.5068814
## 764           dihomo-linoleate (20:2n6)   PDLIM4  0.5106383 -0.5902246
## 828    dihomo-linolenate (20:3n3 or n6)   PDLIM4  0.5189639 -0.5794818
## 1916                    oleate (18:1n9)   PDLIM4  0.4768733 -0.6208702
##              Pval  FDRadjPval
## 445  3.707593e-11 0.000255460
## 1035 4.014098e-08 0.007475101
## 454  1.192003e-06 0.021750342
## 472  1.824876e-08 0.006985406
## 334  2.520151e-08 0.007235126
## 1372 7.444980e-06 0.034991327
## 378  2.629422e-07 0.013725152
## 1537 2.711746e-06 0.025259100
## 1032 8.930059e-09 0.006587808
## 416  1.126755e-08 0.006587808
## 377  5.361419e-07 0.017379131
## 443  3.884724e-08 0.007475101
## 831  3.513583e-07 0.015518739
## 1530 4.842388e-08 0.007686314
## 337  1.870480e-07 0.011823810
## 2310 3.504287e-08 0.007475101
## 1793 4.391951e-06 0.029932097
## 764  1.156266e-08 0.006587808
## 828  1.308633e-08 0.006587808
## 1916 4.908400e-08 0.007686314

We can show some example plots of some of these pairs. The first example is the PDLIM4 vs. adrenate (22:4n6). There appears to be an error with the scatterplot.

IntLIM::PlotGMPair(inputDatafilt, stype = "DIAG", geneName = "PDLIM4", metabName = "adrenate (22:4n6)")

We do find that there is one pair with 2-hydroxyglutarate at FDR adjusted p-value of 0.05 and correlation difference of over 0.5 with GPT2.

IntLIM::PlotGMPair(inputDatafilt, stype = "DIAG", geneName = "GPT2", metabName = "2-hydroxyglutarate")

GPT2 and MYC are not linked as was in the Ambs paper.

IntLIM::PlotGMPair(inputDatafilt, stype = "DIAG", geneName = "MYC", metabName = "2-hydroxyglutarate")

Analyzing Clusters

We will cut the heatmap as below. We find 1038 gene-metabolite pairs in the Tumor-correlated cluster and 1804 pairs in the Tumor anti-correlated cluster. The summary of the cluster.1 shows that the gene-metabolite correlations are positive for the tumor samples making this into the tumor-correlated cluster. For cluster.2, the gene-metabolite correlations are negative for tumor samples making it the tumor anti-correlated cluster.

hc.rows<- hclust(dist(myres@filt.results[,c(3,4)]))
ct<- cutree(hc.rows, k=2) 
cluster.1 <- myres@filt.results[which(ct == 1), ]
cluster.2 <- myres@filt.results[which(ct == 2), ]
dim(cluster.1)
## [1] 1038    6
dim(cluster.2)
## [1] 1804    6
summary(cluster.1$TUMOR_cor)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.008884  0.354654  0.443416  0.425089  0.507297  0.689461
summary(cluster.2$TUMOR_cor)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.7224 -0.5464 -0.4522 -0.4204 -0.3205  0.1396

We have 288 unique genes in the Tumor-correlated cluster and 479 unique genes in the Tumor-anti-correlated cluster.

tumor.corr.uniqgene <- unique(cluster.1$gene)
tumor.anti.corr.uniqgene <- unique(cluster.2$gene)

write.csv(tumor.corr.uniqgene, "tumor.corr.uniqgene.bc.csv")
write.csv(tumor.anti.corr.uniqgene, "tumor.anti.corr.uniqgene.bc.csv")
length(tumor.corr.uniqgene)
## [1] 288
length(tumor.anti.corr.uniqgene)
## [1] 479

We also find out how many unique metabolites we have. We have 155 in the tumor-correlated cluster and 188 in the tumor-anti-correlated cluster.

tumor.corr.uniqmetab <- unique(cluster.1$metab)
tumor.anti.corr.uniqmetab <- unique(cluster.2$metab)
length(tumor.corr.uniqmetab)
## [1] 155
length(tumor.anti.corr.uniqmetab)
## [1] 188

We need to fit the metabolite data to Human Metabolome Database (HMDB) IDs

The Metabolon fData file is imported as follows. This is used to convert the list of unique metabolites into HMDB IDs for the IPA analysis.

fData.metab <- read.csv("fData.metab.csv")
hmdb.match <- fData.metab[,c('id', 'HMDB_ID')]
rownames(hmdb.match) <- hmdb.match$id
write.csv(hmdb.match, "hmdb.match.csv")

tumor.corr.uniqmetab.hmdbid <- hmdb.match[as.character(tumor.corr.uniqmetab),]
tumor.corr.hmdb.list <- tumor.corr.uniqmetab.hmdbid[is.na(tumor.corr.uniqmetab.hmdbid$HMDB_ID) == FALSE,'HMDB_ID']

tumor.anti.corr.uniqmetab.hmdbid <- hmdb.match[as.character(tumor.anti.corr.uniqmetab),]
tumor.anti.corr.hmdb.list <- tumor.anti.corr.uniqmetab.hmdbid[is.na(tumor.anti.corr.uniqmetab.hmdbid$HMDB_ID) == FALSE,'HMDB_ID']


write.csv(tumor.corr.hmdb.list, "tumor.corr.uniqmetab.hmdbid.bc.csv")
write.csv(tumor.anti.corr.hmdb.list, "tumor.anti.corr.uniqmetab.hmdbid.bc.csv")

The following 5 genes are found to be in both clusters.

intersect(tumor.corr.uniqgene, tumor.anti.corr.uniqgene)
## [1] "HIST1H4A" "TLN2"     "USP22"    "CNN1"     "FHL2"     "FRZB"
length(intersect(tumor.corr.uniqgene, tumor.anti.corr.uniqgene))
## [1] 6

These metabolites are found in both clusters

intersect(tumor.corr.uniqmetab, tumor.anti.corr.uniqmetab)
##   [1] "1-arachidonoylglycerophosphoethanolamine*" 
##   [2] "1-methylnicotinamide"                      
##   [3] "1-oleoylglycerophosphocholine"             
##   [4] "1-oleoylglycerophosphoethanolamine"        
##   [5] "1-stearoylglycerol (1-monostearin)"        
##   [6] "1-stearoylglycerophosphocholine"           
##   [7] "1-stearoylglycerophosphoethanolamine"      
##   [8] "1-stearoylglycerophosphoinositol"          
##   [9] "10-heptadecenoate (17:1n7)"                
##  [10] "10-nonadecenoate (19:1n9)"                 
##  [11] "2-aminoadipate"                            
##  [12] "2-hydroxypalmitate"                        
##  [13] "2-hydroxystearate"                         
##  [14] "2-linoleoylglycerophosphocholine*"         
##  [15] "2-linoleoylglycerophosphoethanolamine*"    
##  [16] "2-palmitoleoylglycerophosphocholine*"      
##  [17] "2-palmitoleoylglycerophosphoethanolamine*" 
##  [18] "4-hydroxybutyrate (GHB)"                   
##  [19] "5-oxoproline"                              
##  [20] "5,6-dihydrouracil"                         
##  [21] "adenine"                                   
##  [22] "adrenate (22:4n6)"                         
##  [23] "alanine"                                   
##  [24] "arachidonate (20:4n6)"                     
##  [25] "arginine"                                  
##  [26] "aspartate"                                 
##  [27] "behenate (22:0)"                           
##  [28] "betaine"                                   
##  [29] "C-glycosyltryptophan*"                     
##  [30] "caproate (6:0)"                            
##  [31] "cholesterol"                               
##  [32] "choline"                                   
##  [33] "cystathionine"                             
##  [34] "cysteine"                                  
##  [35] "cystine"                                   
##  [36] "cytidine"                                  
##  [37] "dihomo-linoleate (20:2n6)"                 
##  [38] "dihomo-linolenate (20:3n3 or n6)"          
##  [39] "dimethylarginine (SDMA + ADMA)"            
##  [40] "docosadienoate (22:2n6)"                   
##  [41] "docosahexaenoate (DHA; 22:6n3)"            
##  [42] "docosapentaenoate (n3 DPA; 22:5n3)"        
##  [43] "eicosapentaenoate (EPA; 20:5n3)"           
##  [44] "eicosenoate (20:1n9 or 11)"                
##  [45] "ethanolamine"                              
##  [46] "fructose-6-phosphate"                      
##  [47] "gamma-glutamylglutamate"                   
##  [48] "gamma-glutamylglutamine"                   
##  [49] "gamma-glutamylleucine"                     
##  [50] "glucose"                                   
##  [51] "glutamate"                                 
##  [52] "glutamine"                                 
##  [53] "glycerol"                                  
##  [54] "glycine"                                   
##  [55] "glycylglycine"                             
##  [56] "guanine"                                   
##  [57] "guanosine"                                 
##  [58] "histamine"                                 
##  [59] "hypoxanthine"                              
##  [60] "inositol 1-phosphate (I1P)"                
##  [61] "isoleucine"                                
##  [62] "lactate"                                   
##  [63] "laurate (12:0)"                            
##  [64] "leucine"                                   
##  [65] "linoleate (18:2n6)"                        
##  [66] "linolenate [alpha or gamma; (18:3n3 or 6)]"
##  [67] "lysine"                                    
##  [68] "mannose"                                   
##  [69] "margarate (17:0)"                          
##  [70] "methionine"                                
##  [71] "myo-inositol"                              
##  [72] "myristate (14:0)"                          
##  [73] "myristoleate (14:1n5)"                     
##  [74] "N-acetylalanine"                           
##  [75] "N-acetylglucosamine 6-phosphate"           
##  [76] "N-acetylmethionine"                        
##  [77] "N-acetylneuraminate"                       
##  [78] "N-acetylserine"                            
##  [79] "N-acetylthreonine"                         
##  [80] "N1-methyladenosine"                        
##  [81] "N2,N2-dimethylguanosine"                   
##  [82] "N6-acetyllysine"                           
##  [83] "oleate (18:1n9)"                           
##  [84] "ornithine"                                 
##  [85] "palmitate (16:0)"                          
##  [86] "palmitoleate (16:1n7)"                     
##  [87] "pantothenate"                              
##  [88] "phenylalanine"                             
##  [89] "phosphate"                                 
##  [90] "phosphoethanolamine"                       
##  [91] "proline"                                   
##  [92] "pseudouridine"                             
##  [93] "ribose"                                    
##  [94] "serine"                                    
##  [95] "sphingosine"                               
##  [96] "stearate (18:0)"                           
##  [97] "stearidonate (18:4n3)"                     
##  [98] "stearoylcarnitine"                         
##  [99] "threonine"                                 
## [100] "tryptophan"                                
## [101] "tyrosine"                                  
## [102] "uracil"                                    
## [103] "urate"                                     
## [104] "uridine"                                   
## [105] "valine"                                    
## [106] "X - 10419"                                 
## [107] "X - 11644"                                 
## [108] "X - 11725"                                 
## [109] "X - 11866"                                 
## [110] "X - 12100"                                 
## [111] "X - 12442"                                 
## [112] "X - 12627"                                 
## [113] "X - 12660"                                 
## [114] "X - 12776"                                 
## [115] "X - 12856"                                 
## [116] "X - 12990"                                 
## [117] "X - 13134"                                 
## [118] "X - 13516"                                 
## [119] "X - 13543"                                 
## [120] "X - 14590"                                 
## [121] "X - 14600"                                 
## [122] "X - 14606"                                 
## [123] "X - 14613"                                 
## [124] "X - 15161"                                 
## [125] "X - 15165"                                 
## [126] "X - 15173"                                 
## [127] "X - 3094"                                  
## [128] "X - 4051"                                  
## [129] "X - 4523"                                  
## [130] "X - 8889"                                  
## [131] "X - 8988"
length(intersect(tumor.corr.uniqmetab, tumor.anti.corr.uniqmetab))
## [1] 131

The list of unique genes and metabolites for each cluster are used to conduct a pathway enrichment analysis using Ingenuity Pathway Analysis (IPA) (https://www.qiagenbioinformatics.com/products/ingenuity-pathway-analysis/).

Reference

  1. Terunuma A, Putluri N, Mishra1 P, Mathé EA, Dorsey TH, Yi M, Wallace TA, Issaq HJ, Zhou M, Killian JK, Stevenson HS, Karoly ED, Chan K, Samanta S, Prieto D, Hsu TY.T., Kurley SJ, Putluri V, Sonavane R, Edelman DC, Wulff J, Starks AM, Yang Y, Kittles RA, Yfantis HG, Lee DH, Ioffe OB, Schiff R, Stephens RM, Meltzer PS, Veenstra TD, Westbrook TF, Sreekumar A, and Stefan Ambs S. MYC-driven 2-hydroxyglutarate associates with poor prognosis in breast cancer. J Clin Invest. 2014 Jan 2;124(1):398-412.